In this paper, we develop a new approach of spatially supervised recurrentconvolutional neural networks for visual object tracking. Our recurrentconvolutional network exploits the history of locations as well as thedistinctive visual features learned by the deep neural networks. Inspired byrecent bounding box regression methods for object detection, we study theregression capability of Long Short-Term Memory (LSTM) in the temporal domain,and propose to concatenate high-level visual features produced by convolutionalnetworks with region information. In contrast to existing deep learning basedtrackers that use binary classification for region candidates, we useregression for direct prediction of the tracking locations both at theconvolutional layer and at the recurrent unit. Our extensive experimentalresults and performance comparison with state-of-the-art tracking methods onchallenging benchmark video tracking datasets shows that our tracker is moreaccurate and robust while maintaining low computational cost. For most testvideo sequences, our method achieves the best tracking performance, oftenoutperforms the second best by a large margin.
展开▼